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Abstract. The classic frequentist theory of hypothesis testing devel- 
oped by Neyman, Pearson and Fisher has a claim to being the twenti- 
eth century's most influential piece of applied mathematics. Something 
new is happening in the twenty-first century: high-throughput devices, 
such as microarrays, routinely require simultaneous hypothesis tests for 
thousands of individual cases, not at all what the classical theory had 
in mind. In these situations empirical Bayes information begins to force 
itself upon frequentists and Bayesians alike. The two-groups model is a 
simple Bayesian construction that facilitates empirical Bayes analysis. 
This article concerns the interplay of Bayesian and frequentist ideas 
in the two-groups setting, with particular attention focused on Ben- 
jamini and Hochberg's False Discovery Rate method. Topics include 
the choice and meaning of the null hypothesis in large-scale testing sit- 
uations, power considerations, the limitations of permutation methods, 
significance testing for groups of cases (such as pathways in microarray 
studies), correlation effects, multiple confidence intervals and Bayesian 
competitors to the two-groups model. 

Key words and phrases: Simultaneous tests, empirical null, false dis- 
covery rates. 



1. INTRODUCTION 

Simultaneous hypothesis testing was a lively re- 
search topic during my student days, exemplified by 
Rupert Miller's classic text "Simultaneous Statis- 
tical Inference" (1966, 1981). Attention focused on 
testing N null hypotheses at the same time, where 
N was typically less than half a dozen, though the 
requisite tables might go up to N = 20. Modern sci- 
entific technology, led by the microarray, has upped 
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the ante in dramatic fashion: my examples here will 
have iV's ranging from 200 to 10,000, while N = 
500,000, from SNP analyses, is waiting in the wings. 
[The astrostatistical applications in Liang et al. (2004) 
envision N = 10 10 and more!] 

Miller's text is relentlessly frequentist, reflecting 
a classic Neyman-Pearson testing framework, with 
the main goal being preservation of "a," overall test 
size, in the face of multiple inference. Most of the 
current microarray statistics literature shares this 
goal, and also its frequentist viewpoint, as described 
in the nice review article by Dudoit and Boldrick 
(2003). 

Something changes, though, when N gets big: with 
thousands of parallel inference problems to consider 
simultaneously, Bayesian considerations begin to force 
themselves even upon dedicated frequentists. The 
"two-groups model" of the title is a particularly sim- 
ple Bayesian framework for large-scale testing situa- 
tions. This article explores the interplay of frequen- 
tist and Bayesian ideas in the two-groups setting, 
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Fig. 1. Four examples of large-scale simultaneous inference, each panel indicating N z-values as explained in the text. Panel 
A, prostate cancer microarray study, N = 6033 genes; panel B, comparison of advantaged versus disadvantaged students passing 
mathematics competency tests, N — 3748 high schools; panel C, proteomics study, N = 230 ordered peaks in time-of -flight 
spectroscopy experiment; panel D, imaging study comparing dyslexic versus normal children, showing horizontal slice of 655 
voxels out of N = 15,455, coded "— " for Zi < 0, "+" for Zi>0 and solid circle for Zi>2. 
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with particular attention paid to False Discovery 
Rates (Benjamini and Hochberg, 1995). 

Figure 1 concerns four examples of large-scale si- 
multaneous hypothesis testing. Each example con- 
sists of N individual cases, with each case repre- 
sented by its own z-value "zj," for i= 1,2, . . . , N . 
The Zi's are based on familiar constructions that, 
theoretically, should yield standard N(0, 1) normal 
distributions under a classical null hypothesis, 

(1.1) theoretical nulkz* ~iV(0,l). 

Here is a brief description of the four examples, with 
further information following as needed in the se- 
quel. 

Example A [Prostate data, Singh et al. (2002)]. 
N = 6033 genes on 102 microarrays, n\ = 50 healthy 
males compared with = 52 prostate cancer pa- 
tients; Zi's based on two-sample t statistics compar- 
ing the two categories. 

Example B [Education data, Rogosa (2003)]. 
N = 3748 California high schools; z^s based on bi- 
nomial test of proportion advantaged versus pro- 
portion disadvantaged students passing mathemat- 
ics competency tests. 

Example C [Proteomics data, Turnbull (2006)]. 
N = 230 ordered peaks in time-of-flight spectroscopy 
study of 551 heart disease patients. Each peak's z- 
value was obtained from a Cox regression of the pa- 
tients' survival times, with the predictor variable be- 
ing the 551 observed intensities at that peak. 

Example D [Imaging data, Schwartzman et al. 
(2005]). N = 15,445 voxels in a diffusion tensor 
imaging (DTI) study comparing 6 dyslexic with six 
normal children; Zj's based on two-sample t statistics 
comparing the two groups. The figure shows only 
a single horizontal brain section having 655 voxels, 
with "— " indicating m < 0, "+" for Zi > 0, and solid 
circles for z% > 2. 

Our four examples are enough alike to be usefully 
analyzed by the two-groups model of Section 2, but 
there are some striking differences, too: the theo- 
retical iV(0, 1) null (1.1) is obviously inappropriate 
for the education data of panel B; there is a hint 
of correlation of z-value with peak number in panel 
C, especially near the right limit; and there is sub- 
stantial spatial correlation appearing in the imaging 
data of panel D. 

My plan here is to discuss a range of inference 
problems raised by large-scale hypothesis testing, 



many of which, it seems to me, have been more or 
less underemphasized in a literature focused on con- 
trolling Type-I errors: the choice of a null hypothe- 
sis, limitations of permutation methods, the mean- 
ing of "null" and "nonnull" in large-scale settings, 
questions of power, test of significance for groups of 
cases (e.g., pathways in microarray studies), the ef- 
fects of correlation, multiple confidence statements 
and Bayesian competitors to the two-groups model. 
The presentation is intended to be as nontechnical 
as possible, many of the topics being discussed more 
carefully in Efron (2004, 2005, 2006). References will 
be provided as we go along, but this is not intended 
as a comprehensive review. Microarrays have stim- 
ulated a burst of creativity from the statistics com- 
munity, and I apologize in advance for this article's 
concentration on my own point of view, which aims 
at minimizing the amount of statistical modeling 
required of the statistician. More model-intensive 
techniques, including fully Bayesian approaches, as 
in Parmigiani et al. (2002) or Lewin et al. (2006), 
have their own virtues, which I hope will emerge in 
the Discussion. 

Section 2 discusses the two-groups model and false 
discovery rates in an idealized Bayesian setting. Em- 
pirical Bayes methods are needed to carry out these 
ideas in practice, as discussed in Section 3. This dis- 
cussion assumes a "good" situation, like that of Ex- 
ample A, where the theoretical null (1.1) fits the 
data. When it does not, as in Example B, the em- 
pirical null methods of Section 4 come into play. 
These raise interpretive questions of their own, as 
mentioned above, discussed in the later sections. 

We are living through a scientific revolution pow- 
ered by the new generation of high-throughput ob- 
servational devices. This is a wonderful opportunity 
for statisticians, to redemonstrate our value to the 
scientific world, but also to rethink basic topics in 
statistical theory. Hypothesis testing is the topic 
here, a subject that needs a fresh look in contexts 
like those of Figure 1. 

2. THE TWO-GROUPS MODEL AND FALSE 
DISCOVERY RATES 

The two-groups model is too simple to have a sin- 
gle identifiable author, but it plays an important 
role in the Bayesian microarray literature, as in Lee 
et al. (2000), Newton et al. (2001) and Efron et al. 
(2001). We suppose that the N cases ("genes" as 
they will be called now in deference to microarray 
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studies, though they are not genes in the last three 
examples of Figure 1) are each either null or non- 
null with prior probability po or p\ = 1 — po, and 
with z- values having density either fo(z) or fi(z), 



(2.1) 



Po = Prjnull} fo( z ) density if null, 
pi = Prjnonnull} fi(z) density if nonnuil. 



The usual purpose of large-scale simultaneous test- 
ing is to reduce a vast set of possibilities to a much 
smaller set of scientifically interesting prospects. In 
Example A, for instance, the investigators were prob- 
ably searching for a few genes, or a few hundred at 
most, worthy of intensive study for prostate cancer 
etiology. I will assume 



(2.2) 



Po > 0.90 



in what follows, limiting the nonnuil genes to no 
more than 10%. 

False discovery rate (Fdr) methods have devel- 
oped in a strict frequentist framework, beginning 
with Benjamini and Hochberg's seminal 1995 paper, 
but they also have a convincing Bayesian rationale 
in terms of the two-groups model. Let Fq(z) and 
F\(z) denote the cumulative distribution functions 
(cdf) of fo(z) and fx(z) in (2.1), and define the mix- 
ture cdf F(z) =pqFq(z) -\-p\Fi(z). Then Bayes' rule 
yields the a posteriori probability of a gene being in 
the null group of (2.1) given that its z- value Z is 
less than some threshold z, say "Fdr(z)," as 



(2.3) 



Fdr(z) =Pr{null|Z< z} 
= pMz)/F(z). 



[Here it is notationally convenient to consider the 
negative end of the z scale, values like z = —3. Defi- 
nition (2.3) could just as well be changed to Z > z or 
Z > \z\.] Benjamini and Hochberg's (1995) false dis- 
covery rate control rule begins by estimating F(z) 
with the empirical cdf 



(2.4) 



F(z) = #{ Zi < z}/N, 



yielding Fdr(z) =pqFq(z)/F(z). The rule selects a 
control level "g," say q = 0.1, and then declares as 
nonnuil those genes having z- values z% satisfying Z{ < 
zq, where zq is the maximum value of z satisfying 



(2.5) 



Fdr(zo) < q 



[usually taking po = 1 in (2.3), and Fq the theoretical 
null, the standard normal cdf &(z) of (1.1)]. 



The striking theorem proved in the 1995 paper 
was that the expected proportion of null genes re- 
ported by a statistician following rule (2.5) will be no 
greater than q. This assumes independence among 
the £j's, extended later to various dependence mod- 
els in Benjamini and Yekutieli (2001). The theorem 
is a purely frequentist result, but as pointed out 
in Storey (2002) and Efron and Tibshirani (2002), 
it has a simple Bayesian interpretation via (2.3): 
rule (2.5) is essentially equivalent to declaring non- 
null those genes whose estimated tail-area posterior 
probability of being null is no greater than q. It is 
usually a good sign when Bayesian and frequentist 
ideas converge on a single methodology, as they do 
here. 

Densities are more natural than tail areas for Baye- 
sian fdr interpretation. Defining the mixture density 
from (2.1), 



(2.6) f(z) = 

Bayes' rule gives 

fdr(z) 



Pofo(z) +Pifi(z), 



(2.7) 



:Pr{null|Z = z} 

■Pofo(z)/f(z) 

for the probability of a gene being in the null group 
given z-score z. Here fdr(z) is the local false discov- 
ery rate (Efron et al., 2001; Efron, 2005). 

There is a simple relationship between Fdr(z) and 
fdr(z), 

(2.8) Fdr(z) = E f {idr(Z)\Z < z}, 

"Ef" indicating expectation with respect to the mix- 
ture density f{z). That is, Fdr(z) is the mixture 
average of fdr(Z) for Z < z. In the usual situation 
where fdr(z) decreases as \z\ gets large, Fdr(z) will 
be smaller than fdr(z). Intuitively, if we decide to la- 
bel all genes with zi less than some negative value zq 
as nonnuil, then fdr(zo), the false discovery rate at 
the boundary point zq, will be greater than Fdr(z ), 
the average false discovery rate beyond the bound- 
ary. Figure 2 illustrates the geometrical relationship 
between Fdr(z) and fdr(z); the Benjamini-Hochberg 
Fdr control rule amounts to an upper bound on the 
secant slope. 
For Lehmann alternatives 

(2.9) F 1 (z) = F (z)\ [ 7 <1], 

it turns out that 

fdr(z) 



log 



(2.10) 



1 - fdr(z) 

Fdr(z) 



log 



T-i i / x I +log 

1 — Fdr(z) J \7 
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Fig. 2. Relationship of Fdr(z) to fdr{z). Heavy curve plots numerator of Fdr, poFo(z), versus denominator F{z); fdr(z) is 
slope of tangent, Fdr slope of secant. 



so 



(2.11) 



fdr(z)=Fdr(z)/ 7 



for small values of Fdr. The prostate data of Fig- 
ure 1 has 7 about 1/2 in each tail, making fdr(z) ~ 
2 Fdr(z) near the extremes. 

The statistics literature has not reached consen- 
sus on the choice of q for the Benjamini-Hochberg 
control rule (2.5) — what would be the equivalent of 
0.05 for classical testing — but Bayes factor calcula- 
tions offer some insight. Efron (2005, 2006) uses the 
cutoff point 



(2.12) 



fdr(z) < 0.20 



for reporting nonnull genes, on the admittedly sub- 
jective grounds that fdr values much greater than 
0.20 are dangerously prone to wasting investigators' 
resources. Then (2.6), (2.7) yield posterior odds ra- 
tio 



(2.13) 



Pr{nonnull|2i}/ Pr{null|z} 
= (1 -fdr(z))/fdr(z) 

= Pifi(z)/Pofo{z) 
> 0.8/0.2 = 4. 



Since (2.2) implies pi/po < 1/9, (2.13) corresponds 
to requiring Bayes factor 



(2.14) 



/i(*)//o(*)>36 



in favor of nonnull in order to declare significance. 

Factor (2.14) requires much stronger evidence 
against the null hypothesis than in standard one- 
at-a-time testing, where the critical threshold lies 
somewhere near 3 (Efron and Gous, 2001). The fdr 
0.20 threshold corresponds to q-values in (2.5) be- 
tween 0.05 and 0.15 for moderate choices of 7; such 
g-value thresholds can be interpreted as providing 
conservative Bayes factors for Fdr testing. 

Model (2.1) ignores the fact that investigators usu- 
ally begin with hot prospects in mind, genes that 
have high prior probability of being interesting. Sup- 
pose po(i) is the prior probability that gene i is null, 
and define po as the average of po(i) over all N 
genes. Then Bayes' theorem yields this expression 
for idvi(z) = Prjgenej null]^ = z}\ 



fdrj(z) = fdr(z) 



(2.15) 



1 - (1 - n)idr(z) 

Po(i) I Po 



l-Po(i)' 1-Po 



where fdr (2) = Pofo(z)/ f(z) as before. So for a hot 
prospect having po(i) = 0.50 rather than po = 0.90, 
(2.15) changes an uninteresting result like fdr(zj) = 
0.40 into fdr;(z;) = 0.069. 

Wonderfully neat and exact results like the Benjamini- 
Hochberg Fdr control rule exert a powerful influence 
on statistical theory, sometimes more than is good 
for applied work. Much of the microarray statistics 
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literature seems to me to be overly concerned with 
exact properties borrowed from classical test the- 
ory, at the expense of ignoring the complications 
of large-scale testing. Neatness and exactness are 
mostly missing in what follows as I examine an em- 
pirical Bayes approach to the application of two- 
groups /Fdr ideas to situations like those in Figure 1 . 

3. EMPIRICAL BAYES METHODS 

In practice, the difference between Bayesian and 
frequentist statisticians is their self-confidence in as- 
signing prior distributions to complicated probabil- 
ity models. Large-scale testing problems certainly 
look complicated enough, but this is deceptive; their 
massively parallel structure, with thousands of sim- 
ilar situations each providing information, allows an 
appropriate prior distribution to be estimated from 
the data without upsetting even timid frequentists 
like myself. This is the empirical Bayes approach of 
Robbins and Stein, 50 years old but coming into its 
own in the microarray era; see Efron (2003). 

Consider estimating the local false discovery rate 
fidr(z) =Pofo(z)/f(z), (2.7). I will begin with a "good" 
case, like the prostate data of Example A in Section 
1, where it is easy to believe in the theoretical null 
distribution (1.1), 

(3.1) /oW^^E-le-^. 

The z-values in Example A were obtained by trans- 
forming the usual two-sample t statistic com- 
paring cancer and normal patients' expression levels 
for gene i, to a standard normal scale via 

(3.2) z i = $-\F 100 (t i )); 

here $ and Fiqq are the cdf's of standard normal 
and iioo distributions. If we had only gene z's data 
to test, classic theory would tell us to compare Zi 
with fo(z) = ip(z) as in (3.1). 

For the moment I will take po, the prior proba- 
bility of a gene being null, as known. Section 4 dis- 
cusses po's estimation, but in fact its exact value 
does not make much difference to Fdr(z) or fdr(z), 

(2.3) or (2.7), if po is near 1 as in (2.2). Benjamini 
and Hochberg (1995) take po = 1, providing an up- 
per bound for Fdr(z). 

This leaves us with only the denominator f(z) to 
estimate in (2.7). By definition (2.6), f(z) is the 
marginal density of all N Zj's, so we can use all the 
data to estimate f(z). The algorithm locfdr, an R 



function available from the CRAN library, does this 
by means of standard Poisson GLM software (Efron, 
2005). Suppose the z-values have been binned, giv- 
ing bin counts 

(3.3) y k = #{zi in bin/c}, k = l,2,...,K. 

The prostate data histogram in panel A of Figure 1 
has K = 49 bins of width A = 0.2. 
We take the y k to be independent Poisson counts, 

(3.4) 2/fc~ d P (i/ fe ), k = l,2,...,K, 

with the unknown u k proportional to density f(z) 
at midpoint "xfc" of the kth bin, approximately 

(3.5) u k = NAf(x k ). 

Modeling log(z^) as a pth-degree polynomial func- 
tion of x k makes (3.4)-(3.5) a standard Poisson gen- 
eral linear model (GLM). The choice p = 7 used in 
Figure 3 amounts to estimating f{z) by maximum 
likelihood within the seven-parameter exponential 
family 

(3.6) f(z) = exp\j2PjA- 

lj=o ) 

Notice that p = 2 would make f(z) normal; the ex- 
tra parameters in (3.6) allow flexibility in fitting 
the tails of f(z). Here we are employing Lindsey' 's 
method; see Efron and Tibshirani (1996). Despite 
its unorthodox look, it is no more than a conve- 
nient way to obtain maximum likelihood estimates 
in multiparameter families like (3.6). 

The heavy curve in Figure 3 is an estimate of the 
local false discovery rate for the prostate data, 

(3.7) fdr(z)=p f (z)/f(z), 

with f(z) constructed as above, fo(z) = (p(z) as in 
(3.1), and po = 0.93, as estimated in Section 4; fdr(z) 
is near 1 for \z\ < 2, decreasing to interesting levels 
for \z\ > 3. Fifty-one of the 6033 genes have fdr(zj) < 
0.2, 26 on the right and 25 on the left, and these 
could be reported back to the investigators as likely 
nonnull candidates. [The standard Benjamini- 
Hochberg procedure, (2.5) with q = 0.1, reports 60 
nonnull genes, 28 on the right and 32 on the left.] 

At this point the reader might notice an anomaly: 
if po = 0.93 of the N genes are null, then about 
(1 — po) ■ 6033 = 422 should be nonnull, but only 51 
are reported. The trouble is that most of the non- 
null genes are located in regions of the z axis where 
fdr(zj) exceeds 0.5, and these cannot be reported 
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without also reporting a bevy of null cases. In other 
words, the prostate study is underpowered. 

The vertical bars in Figure 3 are estimates of the 
nonnull counts, the histogram we would see if only 
the nonnull genes provided z- values. In terms of 
(3.3), (3.7), the nonnull counts u y k " are 

(3.8) = [1 - fdr k ]y k , 

where fdv k = ldr(x k ), the estimated fdr value at the 
center of bin k. Since 1 — idr k approximates the non- 
null probability for a gene in bin k, formula (3.8) is 
an obvious estimate for the expected number of non- 
nulls. 

Power diagnostics are obtained from comparisons 
of fdr (2) with the nonnull histogram. High power 

would be indicated if fdr^ was small where y^ was 
large. That obviously is not the case in Figure 3. A 
simple power diagnostic is 

(3.9) ^ {1) = f>i^r fc /f>« 

k=l k=l 

the expected nonnull fdr. We want E fdr to be 
small, perhaps near 0.2, so that a typical nonnull 
gene will show up on a list of likely prospects. The 



prostate data has E fdr = 0.68, indicating low power. 
If the whole study were rerun, we could expect a 
different list of 50 likely nonnull genes, barely over- 
lapping with the first list. Section 3 of Efron (2006) 
discusses power calculations for microarray studies, 
presenting more elaborate power diagnostics. 

Stripped of technicalities, the idea underlying false 
discovery rates is appealingly simple, and in fact 
does not depend on the literal validity of the two- 
groups model (2.1). Consider the bin z% £ [3.1,3.3] in 
the prostate data histogram; 17 of the 6033 genes fall 
into this bin, compared to expected number 2.68 = 
Po-^Ay(3.2) of null genes, giving 

(3.10) Mr = 2.68/17 = 0.16 

as an estimated false discovery rate. (The smoothed 
estimate in Figure 3 is fdr = 0.24.) The implication 
is that only about one-sixth of the 17 are null genes. 
This conclusion can be sharpened, as in Lehmann 
and Romano (2005), but (3.10) catches the main 
idea. 

Notice that we do not need all the null genes to 
have the same density fo(z); it is enough to as- 
sume that the average null density is fo(z), ip(z) in 
this case, in order to calculate the numerator 2.68. 
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Table 1 _ 

Boldface, standard errors of logfdr(^), (local fdr), and 
logFdr(2:), (tail-area), 250 replications of model (3.11), 
iV = 1500. Parentheses, average from formula (5.9), Efron 
(2006); fdr is true value (2.7). Empirical Null results 
explained in Section 4 



Theoretical null 



Empirical null 



z 


fdr 


local 


(formula) 


tail 


local 


(formula) 


tail 


1.5 


0.88 


0.05 


(0.05) 


0.05 


0.04 


(0.04) 


0.10 


2.0 


0.69 


0.08 


(0.09) 


0.05 


0.09 


(0.10) 


0.15 


2.5 


0.38 


0.09 


(0.10) 


0.05 


0.16 


(0.16) 


0.23 


3.0 


0.12 


0.08 


(0.10) 


0.06 


0.25 


(0.25) 


0.32 


3.5 


0.03 


0.10 


(0.13) 


0.07 


0.38 


(0.38) 


0.42 


4.0 


0.005 


0.11 


(0.15) 


0.10 


0.50 


(0.51) 


0.52 



(This is an advantage of false discovery rate meth- 
ods, which only control rates, not individual proba- 
bilities.) The nonnull density fi(z) in (2.1) plays no 
role at all since the denominator 17 is an observed 
quantity. Exchangeability is the key assumption in 
interpreting (3.10): we expect about 1/6 of the 17 
genes to be null, and assign posterior null probabil- 
ity 1/6 to all 17. Nonexchangeability, in the form 
of differing prior information among the 17, can be 
incorporated as in (2.15). 

Density estimation has a reputation for difficulty, 
well-deserved in general situations. However, there 
are good theoretical reasons, presented in Section 
6 of Efron (2005), for believing that mixtures of 
z-values are quite smooth, and that (3.7) will ef- 
ficiently estimate fdr(z). Independence of the z^s is 
not required, only that f(z) is a reasonably close 
estimate of f(z). 

Table 1 reports on a small simulation study in 
which 



(3.11) 



Zi *~ N(m 



i) 



Mi = o, 

with probability 0.9, 
/ii~JV(3,l), 

with probability 0.1, 



for i = 1, 2, . . . , N = 1500. The table shows standard 
deviations for log(fdr(z)), (3.7), from 250 simula- 
tions of (3.11), and also using a delta- method for- 
mula derived in Section 5 of Efron (2006), incor- 
porated in the locfdr algorithm. Rather than (3.6), 
f(z) was modeled by a seven-parameter natural spline 
basis, locfdr^s default, though this gave nearly the 
same results as (3.6). Also shown are standard devia- 
tions for the corresponding tail-area quantity 
log(Fdr(z)) obtained by substituting F(z) = 



jl OQ f{z')dz l in (2.3). [This is a little less variable 
than using F(z), (2.4).] 

The "Theoretical Null" side of the table shows 
that fdr(z) is more variable than Fdr(z), but both 
are more than accurate enough for practical use. At 
z = 3, for example, fdr(z) only errs by about 8%, 
yielding fdr(z) = 0.12 ± 0.01. Standard errors are 
roughly proportional to N~ 1 / 2 , so even reducing N 
to 250 gives fdr(3) = 0.12 ± .025, and similarly for 
other values of z, accurate enough to make pictures 
like Figure 3 believable. 

Empirical Bayes is a bipolar methodology, with al- 
ternating episodes of frequentist and Bayesian activ- 
ity. Frequentists may prefer Fdr [or Fdr, (2.5)] to fdr 
because of connections with classical tail-area hy- 
pothesis testing, or because cdf 's are more straight- 
forward to estimate than densities, while Bayesians 
prefer fdr for its more apt a posteriori interpretation. 
Both, though, combine the Bayesian two-groups model 
with frequentist estimation methods, and deliver the 
same basic information. 

A variety of local fdr estimation methods have 
been suggested, using parametric, semiparametric, 
nonparametric and Bayes methods: Pan et al. (2003), 
Pounds and Morris (2003), Allison et al. (2002), 
Heller and Qing (2003), Broberg (2005), Aubert et 
al. (2004), Liao et al. (2004) and Do et al. (2005), 
all performing reasonably well. The Poisson GLM 
methodology of locfdr has the advantage of easy im- 
plementation with familiar software, and a closed- 
form error analysis. 

Estimation efficiency becomes a more serious prob- 
lem on the "Empirical Null" side of Table 1, where 
we can no longer trust the theoretical null fo(z) ~ 
N(Q, 1). This is the subject of Section 4. 

4. THE EMPIRICAL NULL DISTRIBUTION 

We have been assuming that fo(z), the null den- 
sity in (2.1), is known on theoretical grounds, as in 
(3.1). This leads to false discovery estimates such as 
£v(z)=p f (z)/f(z) and Fdr(z) = Po F (z)/F(z), 
where only denominators need be estimated. Most 
applications of Benjamini and Hochberg's control 
algorithm (2.5) make the same assumption (some- 
times augmented with permutation calculations, 
which usually produce only minor corrections to the 
theoretical null, as discussed in Section 5). Use of the 
theoretical null is mandatory in classic one-at-a-time 
testing, where theory provides the only information 
available for null behavior. But things change in 
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Fig. 4. z -values from two microarray studies. BRCA data (Hedenfalk et al, 2001), comparing seven breast cancer patients 
having BRC'Al mutation to eight with BRCA2 mutation iV = 3226 genes. HIV data (van't Wout et al., 2003) comparing four 
HIV+ males with four HIV— males, N = 7680 genes. Theoretical N(0, 1) null, heavy curve is too narrow for BRCA data, too 
wide for HIV data. Light curves are empirical nulls: normal densities fit to the central histogram counts. 



large-scale simultaneous testing situations: serious 
defects in the theoretical null may become obvious, 
while empirical Bayes methods can provide more re- 
alistic null distributions. 

Figure 4 shows z-value histograms for two addi- 
tional microarray studies, described more fully in 
Efron (2006). These are of the same form as the 
prostate data: n subjects in two disease categories 
provide expression levels for N genes; two-sample t- 
statistics ti comparing the categories are computed 
for each gene, and then transformed to z-values z% = 
$- 1 (F n _ 2 {ti)), as in (3.2). Unlike panel A of Figure 
1, however, neither histogram obeys the theoretical 
iV(0, 1) null near z = 0. The BRCA data has a much 
wider central peak, while the HIV peak is too nar- 
row. The lighter curves in Figure 4 are empirical null 
estimates (Efron, 2004), normal curves fit to the cen- 
tral peak of the z-value histograms. The idea here 
is simple enough: we make the "zero assumption," 

Zero assumption. 

Most of the z-values near 

(4.1) 

come from null genes, 



(discussed further below), generalize the iV(0, 1) the- 
oretical null to A^(5o,ctq), and estimate (<5o,0"q) from 
the histogram counts near z = 0. Locfdr uses two dif- 
ferent estimation methods, analytical and geomet- 
ric, described next. 

Figure 5 shows the geometric method in action on 
the HIV data. The heavy solid curve is log f(z), fit 
from (3.6) using Lindsey's method, as described in 
Efron and Tibshirani (1996). The two-groups model 
and the zero assumption suggest that if /o is nor- 
mal, f(z) should be well-approximated near z = 
by poips ,a {z), with 

(4.2) tp 5om (z) = (2^)"^ exp{-i(^) 2 }, 
making log/(z) approximately quadratic, 

log f(z) = logpo - \\% +log(27rcT 2 )| 

(4.3) ^ 

<5o 1_ 2 

The beaded curve shows the best quadratic approx- 
imation to log/(z) near 0. Matching its coefficients 
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z-value 

Fig. 5. Geometric estimate of null proportion po and empirical null mean and standard deviation (5o,f"o) for the HIV data. 
Heavy curve is log f(z), estimated as in (3.3)-(3.6); beaded curve is best quadratic approximation to log/(z) near z — 0. 



(Po, f3\, P2) to (4.3) yields estimates (So, do,po), for 
instance, ctq = (2/32) -1 / 2 , 



(4.4) 



So 
&o 
Po 



-0.107, 

0.753, 

0.931, 



for the HIV data. Trying the same method with the 
theoretical null, that is, taking (So,ao) = (0,1) in 
(4.3), gives a very poor fit, and po equals the impos- 
sible value 1.20. 

The analytic method makes more explicit use of 
the zero assumption, stipulating that the nonnull 
density fi(z) in the two-groups model (2.1) is sup- 
ported outside some given interval [a, b] containing 
zero (actually chosen by preliminary calculations). 
Let No be the number of Zi in [a, b], and define 

'b — 5o\ ^( a ~ ^0 



Po(So,(T ) = * 



(4.5) 



00 



and 



0=^0- 



Then the likelihood function for zo, the vector of No 
z- values in [a, b] , is 

fs ^ PO (zo) = [e No (i-e) N - No ] 



(4.6) 



M Po(S ,a ) 



This is the product of two exponential family like- 
lihoods, which is numerically easy to solve for the 
maximum likelihood estimates (So, do, po), equaling 
(-0.120,0.787,0.956) for the HIV data. 

Both methods are implemented in locfdr. The an- 
alytic method is somewhat more stable but can be 
more biased than geometric fitting. Efron (2004) 
shows that geometric fitting gives nearly unbiased 
estimates of So and <ro for po > 0.90. Table 2 shows 
how the two methods fared in the simulation study 
of Table 1. 

A healthy literature has sprung up on the estima- 
tion of po, as in Pawitan et al. (2005) and Langlass 
et al. (2005), all of which assumes the validity of 
the theoretical null. The zero assumption plays a 

Table 2 

Comparison of estimates (5o,o"o,po), simulation study of 
Table 1. "Formula" is average from delta-method standard 
deviation formulas, Section 5 in Efron (2006), as 
implemented in locfdr 







Geometric 




Analytic 


mean 


stdev 


(formula) 


mean 


stdev 


(formula) 


So: 


0.02 


0.056 


(0.062) 


0.04 


0.031 


(0.032) 


o : 


1.02 


0.029 


(0.033) 


1.04 


0.031 


(0.031) 


Po: 


0.92 


0.013 


(0.015) 


0.93 


0.009 


(0.011) 
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central role in this literature [which mostly works 
with two-sided p- values rather than z-values, e.g., 
Pi = 2(1 — -Fioo(|ti|)) in (3.2), making the "zero re- 
gion" occur near p = 1]. The two-groups model is 
unidentifiable if /o is unspecified in (2.1), since we 
can redefine /o as /o + cf\ , and p\ as p\ — cpo for any 
c<pi/po- With pi small, (2.2), and /i supposed to 
yield z^s far from for the most part, the zero as- 
sumption is a reasonable way to impose identifiabil- 
ity on the two-groups model. Section 6 considers the 
meaning of the null density more carefully, among 
other things explaining the upward bias of po seen 
in Table 2. 

The empirical null is an expensive luxury from 
the point of view of estimation efficiency. Compar- 
ing the right-hand side of Table 1 with the left re- 
veals factors of 2 or 3 increase in standard error rel- 
ative to the theoretical null, near the crucial point 
where fdr(z) = 0.2. Section 4 of Efron (2005) pins 
the increased variability entirely on the estimation 
of (<5o,0"o); even knowing the true values of po and 
f(z) would reduce the standard error of logfdr(z) 
by less than 1%. (Using tail-area Fdr's rather than 
local fdr's does not help — here the local version is 
less variable.) 

The reason for considering empirical nulls is that 
the theoretical N(0, 1) null does not seem to fit the 
data in situations like Figure 4. For the BRCA data 
we can see that the histogram is overdispersed com- 
pared to -/V(0, 1) around z = 0; the implication is 
that there will be more null counts far from zero 
than the theoretical null predicts, making iV(0, 1) 
false discovery rate calculations like (3.10) too opti- 
mistic. The opposite happens with the HIV data. 

There is a lot at stake here for both Bayesians 
and frequentists. Table 3 shows the number of gene 
discoveries identified by the standard Benjamini- 
Hochberg two-sided Fdr procedure, q = 0.10 in (2.5). 
The HIV results are much more dramatic using the 
empirical null fo(z) ~ ./V(— 0.11, 0.75 2 ) and in fact 
we will see in the next section that do = 0.75 is quite 
believable in this case. The BRCA data has been 
used in the microarray literature to compare anal- 
ysis techniques, under the presumption that better 
techniques will produce more discoveries; recently, 
for instance, in Storey et al. (2005) and Pawitan et 
al. (2005). Table 3 suggests caution in this interpre- 
tation, where using the empirical null negates any 
discoveries at all. 

The z- values in panel C of Figure 1, proteomics 
data, were calculated from standard Cox likelihood 



Table 3 

Number of genes identified as true discoveries by two-sided 
Benjamini-Hochberg procedure, 0.10 control level 





Theoretical null 


Empirical null 


BRCA data: 


107 





HIV data: 


22 


180 



Empirical null densities as in Figure 4. 



tests that should yield N(0, 1) null results asymp- 
totically. A N(-0.02,1.29 2 ) empirical null was ob- 
tained from the analytic method, resulting in only 
one peak with fdr < 0.2; using the theoretical null 
gave six such peaks. 

In panel B of Figure 1, the z- values were obtained 
from familiar binomial calculations, each Zj being 
calculated as 

z = (Pad - Pdis - A) 

( 4 - 7 ) 



Pad(l - Pad) Pdis(l -Pdis) 

"ad n dis 




where n a( j was the number of advantaged students in 
the high school, p a d the proportion passing the test, 
and likewise n^s and pdis for the disadvantaged stu- 
dents; A = 0.192 was the overall difference, median 
(Pad) ~ median (pais)- Here the empirical null stan- 
dard deviation So equals 1.52, half again bigger than 
the theoretical standard deviation we would use if 
we had only one school's data. An empirical null fdr 
analysis yielded 75 schools with fdr < 0.20, 30 on 
the left and 45 on the right. Example B is discussed 
a bit further in the next two sections, where its use 
in the two-groups model is questioned. 

My point here is not that the empirical null is 
always the correct choice. The opposite advice, al- 
ways use the theoretical null, has been inculcated by 
a century of classic one-case-at-a-time testing to the 
point where it is almost subliminal, but it exposes 
the statistician to obvious criticism in situations like 
the BRCA and HIV data. Large-scale simultaneous 
testing produces mass information of a Bayesian na- 
ture that impinges on individual decisions. The two- 
groups model helps bring this information to bear, 
after one decides on the proper choice of /o in (2.1). 
Section 5 discusses this choice, in the form of a list of 
reasons why the theoretical null, and its close friend 
the permutation null, might go astray. 
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5. THEORETICAL, PERMUTATION AND 
EMPIRICAL NULL DISTRIBUTIONS 

Like most statisticians, I have spent my profes- 
sional life happily testing hypotheses against theo- 
retical null distributions. It came as somewhat of a 
shock then, when pictures like Figure 4 suggested 
that the theoretical null might be more theoreti- 
cal than I had supposed. Once suspicious, it be- 
comes easy to think of reasons why fo(z), the cru- 
cial element in the two-groups model (2.1), might 
not obey classical guidelines. This section presents 
four reasons why the theoretical null might fail, and 
also gives me a chance to say something about the 
strengths and weaknesses of permutation null distri- 
butions. 

Reason 1 (Failed mathematical assumptions). 
The usual derivation of the null hypothesis distri- 
bution for a two-sample i-statistic assumes inde- 
pendent and identically distributed (i.i.d.) normal 
components. For the BRCA data of Figure 4, di- 
rect inspection of the 3226 by 15 matrix li X" of 
expression values reveals markedly nonnormal com- 
ponents, skewed to the right (even after the columns 
of X have been standardized to mean and stan- 
dard deviation 1, as in all my examples here). Is this 
causing the failure of the N(0, 1) theoretical null? 

Permutation techniques offer quick relief from such 
concerns. The columns of X are randomly permuted, 
giving a matrix X* with corresponding i-values t* 
and z-values z* = (F n -2(t*)) . This is done some 
large number of times, perhaps 100, and the empiri- 
cal distribution of the 100 ■ N z* 's used as a permuta- 
tion null. The well-known SAM algorithm (Tusher, 
Tibshirani and Chu, 2001) effectively employs the 
permutation null cdf in the numerator of the Fdr 
formula (2.3). 

Applied to the BRCA matrix, the permutation 
null came out nearly N(0,1) (as did simply simulat- 
ing the entries of X* by independent draws from all 
3226 • 15 entries of X), so nonnormal distributions 
were not the cause of BRCA's overwide histogram. 
In practice the permutation null usually approxi- 
mates the theoretical null closely, as a long history 
of research on the permutation t-test demonstrated; 
see Section 5.9 of Lehmann and Romano (2005). 

Reason 2 (Unobserved covariates). The BRCA 
study is observational rather than experimental — 
the 15 women were observed to be BRCA1 or BRCA2, 
not assigned, and likewise with the HIV and prostate 



studies. There are likely to be covariates — age, race, 
general health — that affect the microarray expres- 
sion levels differently for different genes. If these 
were known to us, they could be factored out using a 
separate linear model on each gene's data, providing 
a new and improved Zi obtained from the "Treat- 
ment" coefficient in the model. This would reduce 
the spread of the z-value histogram, perhaps even 
restoring the N(0, 1) theoretical null for the BRCA 
data. 

Unobserved covariates act to broaden the null dis- 
tribution fo(z). They also broaden the nonnull dis- 
tribution fi(z) in (2.1), and the mixture density 
f(z), but this does not correct fdr estimates like 
(3.10), where the numerator, which depends entirely 
on /o, is the only estimated quantity. Section 4 of 
Efron (2004) provides an analysis of a simplified 
model with unobserved covariates. Permutation tech- 
niques cannot recognize unobserved covariates, as 
the model demonstrates. 

Reason 3 (Correlation across arrays). False dis- 
covery rate methodology does not require indepen- 
dence among the test statistics Z{. However, the the- 
oretical null distribution does require independence 
of the expression values used to calculate each Zi\ 
in terms of the elements Xij of the expression ma- 
trix X, for gene i we need independence among 
xn , x i2 , • • • , Xi n in order to validate (1.1). 

Experimental difficulties can undercut across- 
microarray independence, while remaining undetect- 
able in a permutation analysis. This happened in 
both studies of Figure 4 (Efron, 2004, 2006). The 
BRCA data showed strong positive correlations among 
the first four BRCA2 arrays, and also among the last 
four. This reduces the effective degrees of freedom 
for each f-statistic below the nominal 13, making ti 
and Zi = <I>~ 1 (Fi3(tj)) overdispersed. 

Reason 4 (Correlation across genes). Benjamini 
and Hochberg's 1995 paper verified Fdr control for 
rule (2.5) under the assumption of independence 
among the iV z-values (relaxed a little in Benjamini 
and Yekutieli, 2001). This seems fatal for microar- 
ray applications since we expect genes to be corre- 
lated in their actions. A great virtue of the empirical 
Bayes/two-groups approach is that independence is 
not necessary; with Fdr(z) = pqFq(z) / F (z) , for in- 
stance, Fdr(z) can provide a reasonable estimate of 
Pr{null|Z < z} as long as F(z) is roughly unbiased 
for F(z) — in formal terms requiring consistency but 
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not independence — and likewise for the local version 
far(*)=po/ o (*0//(z), (3.7). 

There is, however, a black cloud inside the silver 
lining: the assumption that the null density fo(z) is 
known to the statistician. The empirical null esti- 
mation methods of Section 4 do not require z- value 
independence, and so disperse the black cloud, at 
the expense of increased variability in fdr estimates. 
Do we really need to use an empirical null? Efron 
(2007) discusses the following somewhat disconcert- 
ing result: even if the theoretical null distribution 
Zi ~ N(0, 1) holds exactly true for all null genes, 
Reasons 1-3 above not causing trouble, correlation 
among the Zi's can make the overall null distribu- 
tion effectively much wider or much narrower than 
JV(0,1). 

Microarray data sets tend to have substantial z- 
value correlations. Consider the BRCA data: there 
are more than five million correlations pij between 
pairs of gene z-values z% and zj; by examining the 
row-wise correlations in the X matrix we can esti- 
mate that the distribution of the pij's has approxi- 
mately mean and variance a 2 = 0.153 2 , 

(5.1) P~(0,a 2 ). 

(The zero mean is a consequence of standardizing 
the columns of X.) This is a lot of correlation — as 
much as if the BRCA genes occurred in 10 indepen- 
dent groups, but with common interclass correlation 
0.50 for all genes within a group. 

Section 3 of Efron (2006) shows that under as- 
sumptions (1.1)— (5.1), the ensemble of null-gene z- 
values will behave roughly as 

(5.2) Zi~N(0,o%) 
with 

(5.3) a^ = l + V2A, A~(0,a 2 ). 

If the variable A equaled a = 0.153, for instance, 
giving o"o = 1.10, then the expected number of null 
counts below z = —3 would be about poN$(— 3/1.10) 
rather than poN&(— 3), more than twice as many. 
There is even more correlation in the HIV data, 
a = 0.42, enough so that a moderately negative 
value of A could cause ctq = 0.75, as in Figure 4. 

The random variable A acts like an observable an- 
cillary in the two-groups situation — observable be- 
cause we can estimate ctq from the central counts of 
the z-value histogram, as in Section 4; uq is essen- 
tially the half-width of the central peak. 



Figure 6 is a cautionary story on the dangers of 
ignoring Bq. A simulation model with 

Zi~N(0,l), i = 1,2,..., 2700, and 

(5.4) 

m ~J\T(2.5,1.5), i = 2701,..., 3000, 

was run, in which the null Zj's, the first 2700, were 
correlated to the same degree as in the BRCA data, 
a = 0.153. For each of 1000 simulations of (5.4), 
a standard Benjamini-Hochberg Fdr analysis (2.5) 
(i.e., using the theoretical null for Fq) was run at 
control level q = 0.10, and used to identify a set of 
nonnull genes. 

Each of the thousand points in Figure 6 is (ao, Fdp), 
where ctq is half the distance between the 16th and 
86th percentiles of the 3000 Zj's, and Fdp is the 
"False discovery proportion," the proportion of iden- 
tified genes that were actually null. Fdp averaged 
0.091, close to the target value q = 0.10, but with a 
strong dependence on do: the lowest 5% of So's cor- 
responded to Fdp's averaging only 0.03, while the 
upper 5% average was 0.29, a factor of 9 difference. 

The point here is not that the claimed q- value 0.10 
is wrong, but that in any one simulation we may 
be able to see, from So, that it is probably mislead- 
ing. Using the empirical null counteracts this fallacy 
which, again, is not apparent from the permutation 
null. (Section 4 of Efron, 2007, discusses more elab- 
orate permutation methods that do bear on Figure 
6. See Qui et al., 2005, for a gloomier assessment of 
correlation effects in microarray analyses.) 

What is causing the overdispersion in the Edu- 
cation data of panel B, (4.7)? Correlation across 
schools, Reason 44, seems ruled out by the nature of 
the sampling, leaving Reasons 2 and 3 as likely can- 
didates; unobserved covariates are an obvious threat 
here, while within-school sampling dependences (Rea- 
son 3) are certainly possible. Fdr analysis yields 
eight times as many "significant" schools based on 
the theoretical null rather than /q ~ iV(— 0.35, 1.51 2 ), 
but looks completely untrustworthy to me. 

Sometimes the theoretical null distribution is fine, 
of course. The prostate data had (8o,&o) = (0.00, 1.06) 
according to the analytic method of (4.6), close enough 
to (0, 1) to make theoretical null calculations believ- 
able. However, there are lots of things that can go 
wrong with the theoretical null, and lots of data to 
check it with in large-scale testing situations, mak- 
ing it a matter of due diligence for the statistician to 
do such checking, even if only by visual inspection 
of the z-value histogram. All simultaneous testing 
procedures, not just false discovery rates, go wrong 
if the null distribution is misrepresented. 
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Fig. 6. Benjamini-Hochberg Fdr control procedure (2.5), q = 0.1, run for 1000 simulations of correlated model (5.4); true 
false discovery proportion Fdp plotted versus half-width estimate oq. Overall Fdp averaged 0.091, close to q, but with a strong 
dependence on ero • 



6. A ONE-GROUP MODEL 

Classical one-at-a-time hypothesis testing depends 
on having a unique null density fo(z), such as Stu- 
dent's t distribution for the normal two-sample sit- 
uation. The assumption of unique /o has been car- 
ried over into most of the microarray testing litera- 
ture, including our definition (2.1) of the two-groups 
model. 

Realistic examples of large-scale inference are apt 
to be less clearcut, with true effect sizes ranging con- 
tinuously from zero or near zero to very large. Here 
we consider a "one-group" structural model that al- 
lows for a range of effects. We can still usefully apply 
fdr methods to data from one-group models; doing 
so helps clarify the choice between theoretical and 
empirical null hypotheses, and explicates the biases 
inherent in model (2.1). The discussion in this sec- 
tion, as in Section 2, will be mostly theoretical, in- 
volving probability models rather than collections of 
observed z- values. 

Model (2.1) does not require knowing how the 
z-values were generated, a substantial practical ad- 
vantage of the two-groups formulation. In contrast, 



one-group analysis begins with a specific Bayesian 
structural model. We assume that the ith case has 
an unobserved true value distributed according 
to some density g(fJ,), and that the observed Z\ is 
normally distributed around fa, 

(6.1) t^^gi') and z\fx ~ N(fj,, 1). 

The density g(ft) is allowed to have discrete atoms. 
It might have an atom at zero but this is not re- 
quired, and in any case there is no a priori partition 
of g(fJ>) into null and nonnull components. 

As an example, suppose g(fj,) is a mixture of 90% 
iV(0,0.5 2 ) and 10% JV(2.5, 0.5 2 ), 

(6.2) g{p) = 0.9 • <po,o.5(A0 + 0- 1 ■ ¥>2.5,o.s(a0 

in notation (4.2). The histogram in Figure 7 shows 
N = 3000 draws of fa from (6.2). I am thinking of 
this as a situation having a large proportion of un- 
interesting cases centered near, but not exactly at, 
zero, and a small proportion of interesting cases cen- 
tered far to the right. We still want to use the ob- 
served Zi's from (6.2) to flag cases that are likely to 
be interesting. 
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Fig. 7. Le/t panel: Histogram shows N = 3000 draws of fa from model (6.2); smooth curve is corresponding density f(z), 
(6.3). Right panel: u Emp null" is fdr(z) based on empirical null; it closely matches full Bayes posterior probability "Bayes" 
= Pr{/ifc < 1.5|z} from (6.1)-(6.2); "Theo null" is fdr(z) based on theoretical null, a poor match to Bayes. 



The density of z in model (6.1) is 

/oo 
fit* - z)g(n) dfx, 
-oo 

(6.3) 

[<p(x) =exp(-x 2 /2)/V2^], 
shown as the smooth curve in the left-hand panel, 

(6.4) f(z) = 0.9 • <po,i.u(z) + 0- 1 • <P2.5,i.M- 

The effect of noise in going from fii to Zi ~ N((Mi, 1) 
has blurred the strongly bimodal /x-histogram into 
a smoothly unimodal f{z). 

We can still employ the tactic of Figure 5, fitting a 
quadratic curve to log f(z) around z = to estimate 
Pq and the empirical null density fo(z). Using the 
formulas described later in this section gives 

(6.5) p = 0.93 and f (z) ~ N(.02, 1.14 2 ), 

and corresponding fdr curve pofo(z)/ f(z), labeled 
"Emp null" in the right-hand panel of Figure 7. 

Looking at the histogram, it is reasonable to con- 
sider "interesting" those cases with /ij > 1.5, and 
"uninteresting" /ij < 1.5. The curve labeled "Bayes" 
in Figure 7 is the posterior probability 



Pr{ uninteresting|z} based on full knowledge of (6.1), 
(6.2). The empirical null fdr curve provides an ex- 
cellent estimate of the full Bayes result, without the 
prior knowledge. [An fdr based on the theoretical 
A^O, 1) null is seen to be far off.] 

Unobserved covariates, Reason 2 in Section 4, can 
easily produce blurry null hypotheses like that in 
(6.2). My point here is that the two-group model 
will handle blurry situations if the null hypothesis is 
empirically estimated. Or, to put things negatively, 
theoretical or permutation null methods are prone 
to error in such situations, no matter what kind of 
analysis technique is used. 

Comparing (6.5) with (6.4) shows that fo(z) is 
just about right, but po is substantially larger than 
the value 0.90 we might expect. The f2.5,.5 com- 
ponent of g(n) puts some of its z-values near zero, 
weakening the zero assumption (4.1) and biasing po 
upward. The same thing happened in Table 2 even 
though model (3.11) is "unblurred," g(fi) having a 
point mass at fj, = 0. Fortunately, po is the least im- 
portant part of the two-groups model for estimat- 
ing fdr(z), under assumption (2.2). "Bias" can be a 
misleading term in model (6.1) since it presupposes 
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that each \Xi is clearly defined as null or nonnull. 
This seems clear enough in (3.11). The null/nonnull 
distinction is less clear in (6.2), though it still makes 
sense to search for cases that have /i, unusually far 
from 0. 

The results in (6.5) come from a theoretical anal- 
ysis of model (6.1). The idea in what follows is to 
generalize the construction in Figure 5 by approxi- 
mating £(z) = log/ (2) with Taylor series other than 
quadratic. 

The Jth Taylor approximation to £{z) is 



(6.6) 



3=0 



where £^°\0) = log/(0) and for j > 1 



(6.7) 



dnogf(z) 



dzi 



z=0 



Let fo(z) indicate the subdensity Pofo( z ), the nu- 
merator of fdr(z) in (2.7). The choice 



(6.8) 



matches f{z) at z = (a convenient form of the zero 
assumption) and leads to an fdr expression 



(6.9) 



fdr(z) 



J A*) 



//(*)■ 



Larger choices of J match fo(z) more accurately to 
f(z), increasing ratio (6.9); the interesting z-values, 
those with small fdr's, are pushed farther away from 
zero as we allow more of the data structure to be 
explained by the null density. 

Bayesian model (6.1) provides a helpful interpre- 
tation of the derivatives £^(0): 

Lemma. The derivative £^(0) , (6.7), is the jth 
cumulant of the posterior distribution of fj, given z = 

0, except that £^ 2 \0) is the second cumulant minus 

1. Thus 



(6.10) 



*«(0) 
-^ 2 )(0) 



Eq and 
l-V = %, 



where Eq and Vq are the posterior mean and vari- 
ance of \jl given z = 0. 

Proof of the lemma appears in Section 7 of Efron 
(2005). 

For J = 0,1,2, formulas (6.8), (6.9) yield simple 
expressions for po and fo(z) in terms of /(0), Eq 



Table 4 

Expressions for po, fo and fdr, first three choices of J in 
(6.8), (6.9); Vo = 1 — Vo; J = gives theoretical null, J = 2 
empirical null; f(z) from (6.3) 



J 





1 


2 


po 
/oW 
fdr(z) 
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#(0,1) 
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/(«) 



and Vo- These are summarized in Table 4, with pq 
obtained from 



(6.11) 



Po 



fo(z)dz. 



Formulas are also available for Fdr (2), (2.8). 

The choices J = 0, 1, 2 in Table 4 result in a nor- 
mal null density fo(z), the only difference being the 
means and variances. Going to J = 3 allows for an 
asymmetric choice of fo(z), 



(6.12) fdr(z) 



/(*)' 



,E z-V z 2 /2+S z 3 /6 



where So is the posterior third central moment of \i 
given z = in model (6.1). The program loefdr uses 
a variant, the "split normal," to model asymmetric 
null densities, with the exponent of (6.12) replaced 
by a quadratic spline in z. 

The lemma bears on the difference between em- 
pirical and theoretical nulls. Suppose that the prob- 
ability mass of g{ij) occurring within a few units 
of the origin is concentrated in an atom at ft = 0. 
Then the posterior mean and variance (Eq,Vq) of \x 
given z = will be near 0, making (Eq,Vo)=(0, 1). 
In this case the empirical null ( J = 2) will approx- 
imate the theoretical null (J = 0). Otherwise the 
two nulls differ; in particular, any mass of g(n) near 
zero increases Vo, swelling the standard deviation 
(1 — Vo)" 1 / 2 of the empirical null. 

The two-groups model (2.1), (2.2) puts one in a 
hypothesis-testing frame of mind: a large group of 
uninteresting cases is to be statistically separated 
from a small interesting group. Even blurry situa- 
tions like (6.2) exhibit a clear grouping, as in Figure 
7. None of this is necessary for the one-group model 
(6.1). We might, for example, suppose that g(fi) is 
normal, 



(6.13) 



H~N(A,B 2 
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and proceed in an empirical Bayes way to estimate 
A and B and then apply Bayes estimation to the 
individual cases. 

This line of thought leads directly to James-Stein 
estimation (Efron and Morris, 1975). Estimation, as 
opposed to testing, is the key word here — with pos- 
sible effect sizes Hi varying continuously rather than 
having a large clump of values near zero. The Edu- 
cation data of panel B, Figure 1, could reasonably 
be analyzed this way, instead of through simultane- 
ous testing. Scientific context, which says that there 
is likely to be a large group of (nearly) unaffected 
genes, as in (2.2), is what makes the two-groups 
model a reasonable Bayes prior for microarray stud- 
ies. 

7. BAYESIAN AND FREQUENTIST 
CONFIDENCE STATEMENTS 

False discovery rate methods provide a happy mar- 
riage between Bayesian and frequentist approaches 
to multiple testing, as shown in Section 2. Empiri- 
cal Bayes techniques based on the two-groups model 
seem to give us the best of both statistical philoso- 
phies. Things do not always work out so peaceably; 
in these next two sections I want to discuss con- 
tentious situations where the divorce court looms as 
a possibility. 

An insightful and ingenious paper by Benjamini 
and Yekutieli (2005) discusses the following problem 
in simultaneous significance testing: having applied 
false discovery rate methods to select a set of nonnull 
cases, how can confidence intervals be assigned to 
the true effect size for each selected case? (The paper 
and the ensuing discussion are much more general, 
but this is all I need for the illustration here.) 

Figure 8 concerns Benjamini and Yekutieli's so- 
lution applied to the following simulated data set: 
N = 10,000 (ni,Zi) pairs were generated as in (6.1), 
with 90% of the Hi zero, the null cases, and 10% 
distributed iV(-3,l), 

(7.1) g(p) = 0.90 • S (h) + 0-10 • <p-sM, 

So(/i) a delta function at h = 0. The Fdr procedure 
(2.5) was applied with q$ = 0.05, yielding 566 non- 
null "discoveries," those having m < —2.77. 

The Benjamini- Yekutieli "false coverage rate" 
(FCR) control procedure provides upper and lower 
bounds for the true effect size Hi corresponding to 
each Z{ less than —2.77; these are indicated by heavy 
diagonal lines in Figure 8, constructed as described 



in BY's Definition 1. This construction guarantees 
that the expected proportion of the 566 intervals 
not containing the true His the false coverage rate, 
is bounded by q = 0.05. 

In a real application only the Zi's and their BY 
confidence intervals could be seen, but in a simula- 
tion we can plot the actual (zi,Hi) pairs, and com- 
pare them to the intervals. Figure 8 plots (zi,Hi) for 
the 1000 nonnull cases, those from Hi ~ N(— 3, 1) 
in (7.1). Of these, 55,2 plotted as heavy points, lie 
to the left of z$ = —2.77, the Fdr threshold, with 
the other 448 plotted as light points; 14 null cases, 
fii = 0, plotted as "+," also had z% < zq. 

The first thing to notice is that the FCR property 
is satisfied: only 17 of the 566 intervals have failed to 
contain Hi (14 of these the +'s), giving 3% noncover- 
age. The second thing, though, is that the intervals 
are frighteningly wide — Zj ± 2.77, about y/2 longer 
than the usual individual 95% intervals Zi ± 1.96 — 
and poorly centered, particularly at the left where 
all the /j,i's fall in their intervals' upper halves. 

An interesting comparison is with Bayes' rule ap- 
plied to (6.1), (7.1), which yields 

(7.2) Pr{/i = 0|z i } = fdr(z i ), 

where 

fdr(z) = 0.9 -tpo^z) 

(7-3) 

• [0.9 ^ ,iW + 0.1 ■<p_ 3tV 2(z)]- 1 

as in (2.7), and 

(7.4) 0(^1^0,*) 

That is, fit is null with probability fdr(zj), and N((zi — 
3)/2, 1/2) with probability 1 — fdr(zj). The dashed 
lines indicate the posterior 95% intervals given that 
Hi is nonnull, {z,i — 3)/2 ± 1.96/v2, now y/2 shorter 
than the usual individual intervals; at the top of 
Figure 9 the beaded curve shows fdr(zj). 

The frequentist FCR intervals and the Bayes in- 
tervals are pursuing the same goal, to include the 
nonnull scores Hi with 95% probability. At Zi = — 2.77 
the FCR assessment is Pt{h € [-5.54,0]} = 0.95; 
Bayes' rule states that /Xj = with probability 
fdr(-2.77) = 0.25, and if Hi ± 0, then m G [-4.27, 
— 1.49] with probability 0.95. This kind of discon- 
nected description is natural to the two-groups model. 
A principal cause of FCR's oversized intervals (the 
paper shows that no FCR-controlling intervals can 
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Fig. 8. Benjamini-Yekutieli FCR controlling intervals applied to simulated sample of 10,000 cases from (6.1), (7.1). 566 
cases have z% < zo = —2.77, the Fdr (0.05) threshold. Plotted points are (zi,fii) for the 1000 nonnull cases; 14 null cases with 
Zi < zo indicated by "+." Heavy diagonal lines indicate FCR 95% interval limits; light lines are Bayes 95% posterior intervals 
given fa 7^ 0. Beaded curve at top is fdr(zi), posterior probability fa — 0. 
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Fig. 9. Computing a p-value for zs = 0.842, average of 15 z-values in CTL pathway, p53 data Solid histogram 500 row 
randomizations give p-value 0.002. Line histogram 500 column permutations give p-value 0.04-8. 
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be much narrower) comes from using a single con- 
nected set to describe a disconnected situation. 

Of course Bayes' rule will not be easily available 
to us in most practical problems. Is there an empir- 
ical Bayes solution? Part of the solution certainly is 
there: estimating fdr(z) as in Section 3. Estimating 
g(/j,i\[ii 7^ 0, Zi), (7.4), is more challenging. A straight- 
forward approach uses the nonnull counts (3.8) to 
estimate the nonnull density fi(z) in (2.1), decon- 
volves fi(z) to estimate the nonnull component 
"gi(fj,) n in (7.1), and applies Bayes' rule directly to 
<7x- This works reasonably well in Figure 8's exam- 
ple, but deconvolution calculations are notoriously 
tricky and I have not been able to produce a stable 
general algorithm. 

Good frequentist methods like the FCR proce- 
dure enjoy the considerable charm of an exact error 
bound, without requiring a priori specifications, and 
of course there is no law that they have to agree with 
any particular Bayesian analysis. In large-scale sit- 
uations, however, empirical Bayes information can 
overwhelm both frequentist and Bayesian predilec- 
tions, hopefully leading to a more satisfactory com- 
promise between the two sets of intervals appearing 
in Figure 8. 

8. IS A SET OF GENES ENRICHED? 

Microarray experiments, through a combination 
of insufficient data per gene and massively multiple 
simultaneous inference, often yield disappointing re- 
sults. In search of greater detection power, enrich- 
ment analysis considers the combined outcomes of 
biologically defined sets of genes, such as pathways. 
As a hypothetical example, if the 20 z- values in a 
certain pathway all were positive, we might infer sig- 
nificance to the pathway's effect, whether or not any 
of the individual z^s were deemed nonnull. 

Our example here will involve the p53 data, from 
Subramanian et al. (2005), N = 10,100 genes on 
n = 50 microarrays, Zj's as in (3.2), whose z- value 
histogram looks like a slightly short-tailed normal 
distribution having mean 0.04 and standard devia- 
tion 1.06. Fdr analysis (2.5), (7 = 0. 1, yielded just one 
nonnull gene, while enrichment analysis indicated 
seven or eight significant gene sets, as discussed at 
length in Efron and Tibshirani (2006). 

Figure 9 concerns the CTL pathway, a set of 15 
genes relating to the development of so-called killer 
T cells, #95 in a catalogue of 522 gene-sets provided 
by Subramanian et al. (2005). For a given gene-set 



"5" with m members, let % denote the mean of the 
m z-values within S; % is the enrichment statis- 
tic suggested in the Bioconductor R package limma 
(Smyth, 2004), 

(8.1) z s = 0.842 

for the CTL pathway. How significant is this result? I 
will consider assigning an individual p- value to (8.1), 
not taking into account multiple inference for a cat- 
alogue of possible gene-sets (which we could correct 
for later using Fdr methods, for instance, to combine 
the individual p- values). 

Limma computes p- values by "row randomization,' 
that is, by randomizing the order of rows of the 
N x n expression matrix X , and recomputing the 
statistic of interest. For a simple average like (8.1) 
this amounts to choosing random subsets of size 
m = 15 from the N = 10,100 z^s and comparing 
zs to the distribution of the randomized values Zg . 
Five hundred rowrands produced only one z$ > %, 
giving p- value 1/500 = 0.002. 

Subramanian et al. calculate p-values by permut- 
ing the columns of X rather than the rows. The per- 
mutations yield a much wider distribution than the 
row randomizations in Figure 9, with correspond- 
ing p- value 0.048. The reason is simple: the genes in 
the CTL pathway have highly correlated expression 
levels that increase the variance of z$; column- wise 
permutations of X preserve the correlations across 
genes, while row randomizations destroy them. 

At this point it looks like column permutations 
should always give the right answer. Wrong! For the 
BRCA data in Figure 4, the ensemble of z-values 
has (mean, standard deviation) about (0, 1.50), com- 
pared to (0,1) for z*'s from column permutations. 
This shrinks the permutation variability of Zg, com- 
pared to what one would get from a random selec- 
tion of genes for S, and can easily reverse the rela- 
tionship in Figure 9. 

The trouble here is that there are two obvious, but 
different, null hypotheses for testing enrichment: 

Randomization null hypothesis S has been chosen 
by random selection of m genes from the full set of 
N genes. 

Permutation null hypothesis The order of the n 
microarrays has been chosen at random with respect 
to the patient characteristics (e.g., with the patient 
being in the normal or cancer category in Example 
A of the Introduction). 
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Fig. 10. Enrichment analysis of Imaging data, panel D of Figure 1; z -value for original 15,445 voxels have been averaged 
over a gene-sets" of neighboring voxels with city-block distance < 2. Coded as "— " for Zi < 0, "+" for z% > 0; solid rectangles, 
labeled as "Enriched," show voxels with fdr(%) < 0.2, using empirical null. 



Efron and Tibshirani (2006) suggest a compro- 
mise method, restandardization, that to some de- 
gree accommodates both null hypotheses. Instead of 
permuting z$ in (8.1), restandardization permutes 
(zs — fi z )/cr z , where (fi z ,a z ) are the mean and stan- 
dard deviation of all N Zj's. Subramanian et al. do 
something similar using a Kolmogorov-Smirnov en- 
richment statistic. 

All of these methods are purely frequentistic. The- 
oretically we might consider applying the two-groups / 
empirical Bayes approach to sets of z- values "zs," 
just as we did for individual Zi's in Sections 2 and 
3. For at least three reasons that turns out to be 
extremely difficult: 

• My technique for estimating the mixture density 
/, as in (3.6), becomes exponentially more diffi- 
cult in higher dimensions. 

• There is not likely to be satisfactory theoretical 
null /o for the correlated components of z~s, while 
estimating an empirical null faces the same "curse 
of dimensionality" as for /. 

• As discussed following (3.10), false discovery rate 
interpretation depends on exchangeability, essen- 
tially an equal a priori interest in all N genes. 
There may be just one gene-set S of interest to 
an investigator, or a catalogue of several hundred 
5's as in Subramanian et al., but we certainly are 



not interested in all possible gene-sets. It would 
be a daunting exercise in subjective, as opposed 
to empirical, Bayesianism to assign prior proba- 
bilities to any particular gene-set S. 

Having said this, it turns out there is one "gene- 
set" situation where the two-groups /empirical Bayes 
approach is practical (though it does not involve 
genes). Looking at panel D of Figure 1, the Imag- 
ing data, the obvious spatial correlation among z- 
values suggests local averaging to reduce the effects 
of noise. 

This has been carried out in Figure 10: at voxel i 
of the N = 15,445 voxels, the average of z-values for 
those voxels within city-block distance 2 has been 
computed, say The results for the same hori- 
zontal slice as in panel D are shown using a similar 
symbol code. Now that we have a single number Z{ 
for each voxel, we can compute the empirical null 
fdr estimates as in Section 4. The voxels labeled "en- 
riched" in Figure 10 are those having fdr(zj) < 0.2. 

Enrichment analysis looks much more familiar in 
this example, being no more than local spatial smooth- 
ing. The convenient geometry of three-dimensional 
space has come to our rescue, which it emphatically 
fails to do in the microarray context. 
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9. CONCLUSION 

Three forces influence the state of statistical sci- 
ence at any one time: mathematics, computation 
and applications, by which I mean the type of prob- 
lems subject-area scientists bring to us for solution. 
The Fisher-Neyman-Pearson theory of hypothesis 
testing was fashioned for a scientific world where 
experimentation was slow and difficult, producing 
small data sets focused on answering single ques- 
tions. It was wonderfully successful within this mi- 
lieu, combining elegant mathematics and limited com- 
putational equipment to produce dependable answers 
in a wide variety of application areas. 

The three forces have changed relative intensi- 
ties recently. Computation has become literally mil- 
lions of times faster and more powerful, while scien- 
tific applications now spout data in fire-hose quanti- 
ties. (Mathematics, of course, is still mathematics.) 
Statistics is changing in response, as it moves to ac- 
commodate massive data sets that aim to answer 
thousands of questions simultaneously. Hypothesis 
testing is just one part of the story, but statistical 
history suggests that it could play a central role: 
its development in the first third of the twentieth 
century led directly to confidence intervals, decision 
theory and the flowering of mathematical statistics. 

I believe, or maybe just hope, that our new scien- 
tific environment will also inspire a new look at old 
philosophical questions. Neither Bayesians nor fre- 
quentists are immune to the pressures of scientific 
necessity. Lurking behind the specific methodology 
of this paper is the broader, still mainly unanswered, 
question of how one should combine evidence from 
thousands of parallel but not identical hypothesis 
testing situations. What I called "empirical Bayes 
information" accumulates in a way that is not well 
understood yet, but still has to be acknowledged: 
in the situations of Figure 4, the frequentist is not 
free to stick with classical null hypotheses, while the 
Bayesian cannot use prior (6.13), at least not with- 
out the risk of substantial inferential confusion. 

Classical statistics developed in a data-poor en- 
vironment, as Fisher's favorite description, "small- 
sample theory," suggests. By contrast, modern-day 
disciplines such as machine learning seem to struggle 
with the difficulties of too much data. Both prob- 
lems, too little and too much data, can afflict mi- 
croarray studies. Massive data sets like those in Fig- 
ure 1 are misleadingly comforting in their sugges- 
tion of great statistical accuracy. As I have tried to 



show here, the power to detect interesting specific 
cases, genes, may still be quite low. New methods 
are needed, perhaps along the lines of "enrichment," 
as well theory of experimental design explicitly 
fashioned for large-scale testing situations. 

One floor up from the philosophical basement lives 
the untidy family of statistical models. In this pa- 
per I have tried to minimize modeling decisions by 
working directly with z-values. The combination of 
the two-groups model and false discovery rates ap- 
plied to the z-value histogram is notably light on 
assumptions, more so when using an empirical null, 
which does not even require independence across the 
columns of X (i.e., across microarrays, a dangerous 
assumption as shown in Section 6 of Efron, 2004). 
There will certainly be situations when modeling in- 
side the X matrix, as in Newton et al. (2004) or 
Kerr, Martin and Churchill (2000), yields more in- 
formation than z-value procedures, but I will leave 
that for others to discuss. 
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